Validating hidden Markov models for seabird behavioural inference

Abstract Understanding animal movement and behaviour can aid spatial planning and inform conservation management. However, it is difficult to directly observe behaviours in remote and hostile terrain such as the marine environment. Different underlying states can be identified from telemetry data using hidden Markov models (HMMs). The inferred states are subsequently associated with different behaviours, using ecological knowledge of the species. However, the inferred behaviours are not typically validated due to difficulty obtaining ‘ground truth’ behavioural information. We investigate the accuracy of inferred behaviours by considering a unique data set provided by Joint Nature Conservation Committee. The data consist of simultaneous proxy movement tracks of the boat (defined as visual tracks as birds are followed by eye) and seabird behaviour obtained by observers on the boat. We demonstrate that visual tracking data is suitable for our study. Accuracy of HMMs ranging from 71% to 87% during chick‐rearing and 54% to 70% during incubation was generally insensitive to model choice, even when AIC values varied substantially across different models. Finally, we show that for foraging, a state of primary interest for conservation purposes, identified missed foraging bouts lasted for only a few seconds. We conclude that HMMs fitted to tracking data have the potential to accurately identify important conservation‐relevant behaviours, demonstrated by a comparison in which visual tracking data provide a ‘gold standard’ of manually classified behaviours to validate against. Confidence in using HMMs for behavioural inference should increase as a result of these findings, but future work is needed to assess the generalisability of the results, and we recommend that, wherever feasible, validation data be collected alongside GPS tracking data to validate model performance. This work has important implications for animal conservation, where the size and location of protected areas are often informed by behaviours identified using HMMs fitted to movement data.


| INTRODUC TI ON
Seabirds are key indicators of marine environmental health (Lascelles et al., 2012;Parsons et al., 2008) but are the most threatened and anthropogenically pressured group of birds globally (Croxall et al., 2012).Threats, including invasive species at breeding colonies, climate change, over-fishing and offshore renewable developments, have resulted in a global decline in seabird populations of 70% over the last five decades (Vulcano et al., 2021).
In the United Kingdom, some species of seabirds (e.g.Northern and black-legged kittiwake [Rissa tridactyla]) have continued to decline (JNCC, 2021).Of the 25 seabird species that regularly breed in the United Kingdom, 24 are listed as Red or Amber on the UK's Birds of Conservation Concern (Stanbury et al., 2021).Under the Habitats Directive (EC/92/43) and Birds Directive (EC/79/409), Special Protection Areas (SPA) are established to form the Natura 2000 network, which protects species and habitats (European-Commission et al., 2008).Although SPAs have historically been restricted to small areas focused on seabird breeding colonies, recent extensions and new classifications in the marine environment have expanded the SPA network across the United Kingdom (JNCC, 2020).Seabirds are restricted to central place foraging during the breeding season.Therefore, understanding at-sea behaviour, including characterising important foraging areas, is vital to ensure adequate protection measures are in place to prevent further population decline.
Seabird tracking studies, where individuals are tagged using biologging technology, are an effective way to understand space use and behaviour (Bennison et al., 2018;Lascelles et al., 2012;Wakefield et al., 2017).Technological advances have accelerated the availability of biologging information from devices such as Global Positioning System (GPS) transmitters, accelerometers, conductivity-temperature-depth (CTD) tags and harmonic radar trackers (Cooke et al., 2004).Telemetry data provides information on animal locations at discrete intervals but does not provide direct information about the underlying behaviour of the tagged animals.
To infer behavioural states such as foraging, flying and resting from movement data, hidden Markov Models (HMMs) have been widely used (Langrock et al., 2012;McClintock, 2021;McKellar et al., 2015;Morales et al., 2004;Patterson et al., 2009).HMMs are time series models with observation and state processes where the latent (unobserved) states are often interpreted as activities or behaviours relating to the ecology of the species (Langrock et al., 2012).Ecologically inferred behaviours post-analysis, simplified to inferred behaviours hereafter, can be used to inform conservation decision-making, for example, the size and location of protected areas.
One limitation of using inferred behaviours to inform conservation-relevant decision-making is the difficulty in validating models using ground truth data.Some studies have attempted to validate inferred behaviours from movement data, such as Joo et al. (2013), which validated the behaviour of fishing vessels using ground truth data recorded by onboard observers.Bennison et al. (2018) and Conners et al. (2021) also validated inferred behaviours of northern gannet (Morus bassanus) and albatross using behaviours from depth recorder and sensors as ground truth data respectively.However, depth recorders and sensors are also proxies for ground truth data with their own error structures.Overall, little research has focused on evaluating the performance of HMMs fitted to animal movement data through data validation because contemporaneous behavioural observations on tracked individuals can be challenging to collect, particularly in featureless environments, such as open ocean (Joo et al., 2013).To examine the performance of HMMs fitted to movement data, we consider a unique data set provided by the Joint Nature Conservation Committee (JNCC) and obtained via the visual tracking of terns (Sterna spp.) using a rigidhulled inflatable boat.A visual tracking method developed by Perrow et al. (2011) was conducted at several tern breeding colonies across the United Kingdom during chick-rearing and incubation in different years (Wilson et al., 2014).Proxy movement data, corresponding to the GPS location of the boat, and the observed behavioural data of the terns directly recorded by the observers on the boat were collected at the same time frequency.
First-hand behavioural data of seabirds such as that collected by Wilson et al. (2014) is generally not feasible to collect directly alongside GPS tracking location data.We consider terns as a case study to examine the performance of HMMs for behavioural inference.To the best of our knowledge, this is the first study to validate inferred behaviours from movement data using observed behavioural data of seabirds.Our study aims to leverage the rare opportunity provided by the unique JNCC data set to (i) examine whether boat locational data are an adequate proxy of tern movement data in relation to inferred behaviours and (ii) validate inferred behaviours of seabirds from HMMs fitted to boat locational data using behavioural observations of seabirds taken by the observes from the boat.

| Study species and sites
This study investigates the movement behaviour of four tern (Sterna spp.) species: Arctic (Sterna paradisaea), common (S.hirundo), Sandwich (S. sandvicensis) and roseate (S. dougallii).Arctic with some pairs occasionally breeding in south-east Scotland, Norfolk and Hampshire (Wilson et al., 2014).Study sites comprised of nine breeding colonies across the United Kingdom (Figure 1): Blue Circle ( 54• 49 ′ N, 5 • 46 ′ W) and Cockle Island (54 ) in Northern Ireland; Cemlyn Bay (53 Wales; Glas-Eileanan Island (56 and South Shian (56 Terns are ground-nesting colonial breeders, raising one brood each breeding season (May-June) and laying a clutch of one to three eggs.While breeding adult terns are central place foragers throughout the breeding season, they are particularly restricted during chick-rearing when they must return regularly to provision their chicks, and adults spend up to 80% of their time foraging (Thaxter et al., 2012).Sandwich terns are specialist predators that can exploit clupeids and sandeels from deeper water, potentially due to their wider foraging range.Likewise, roseate terns are specialists who also forage by plunge diving to depth, catching prey items of predominately sandeels, herring and sprat.Common terns are generalist predators and prey items include invertebrates, clupeids, sandeels and gadoids.Arctic terns forage using several techniques but are heavily dependent on sandeel and changes in prey availability can affect their breeding success (Eglington & Perrow, 2014).

| Visual tracking data
Visual tracking data were collected using a technique developed by Perrow et al. (2011) and detailed in Wilson et al. (2014).We An ethogram of continuous flight behaviours and instantaneous foraging events was provided to each observer, and the timing of each behaviour was recorded (Wilson et al., 2014).Flight behaviours were categorised as active search, transit search and direct flight.
Direct flight was defined as a clear and consistent direction with fast flight usually returning to the colony with food.An active search was defined as an erratic flight course actively searching for food, which may include instances of diving and surface feeding.It is hypothesised that for a direct flight, terns have a fixed location in view and fly in a clear and consistent direction, whereas for transit search, they may change direction but not erratically to search for food (Wilson et al., 2014).As a result, direct flight and transit search were defined as observed not-foraging behaviour while an active search was defined as observed foraging behaviour.These behavioural data are used as the validation data in the study.2 provides an example of visual tracks for the two breeding seasons.The data combined both complete and incomplete tracks of terns.The track of terns was considered complete if individual terns were tracked leaving and returning to the colony.Incomplete tracks were terns that could not be successfully followed back to the colony.
Reasons for incomplete tracks could be individuals flying faster than the boat could follow, flying over a physical obstruction that prevented the boat from following, observers confusing the tracked individuals with other terns, or insufficient fuel in the boat (Perrow et al., 2011).
Visual tracks for which (i) a single observed behaviour was recorded throughout the tracking trip and (ii) total tracking time that did not exceed 1 min are omitted (see Table S1 for a summary of visual tracks used, Supporting information).GPS coordinates of the boat are subsequently converted into step length (km) and turning angle (radians).These calculated metrics potentially provide information about tern behaviour.For example, foraging behavioural activities are typically characterised by slow and tortuous flight, indicating smaller step lengths and low directional persistence in turnings.In contrast, not-foraging behavioural activities are generally characterised by longer step lengths and high directional persistence in turnings (Morales et al., 2004).

| Visual tracks as a proxy for tern tracks
Given that the boat followed at a distance c. given by: where We compare (i) boat tracks and approximate tern tracks and (ii) the distribution of step length and angles corresponding to the boats and approximate tern tracks to determine whether the former can be used as an approximation for the movement of individual terns.
We then model the boat and approximated tern tracking data using HMMs to account for the different movement patterns dependent on the (unknown) underlying behavioural states.We then extract the inferred behavioural states from models fitted to both data sets and create a confusion matrix to assess differences and similarities in inferred states.

| Hidden Markov model (HMM)
A HMM (Figure 3) is a time series model with an observed component, X t , driven by an underlying latent component known as the state process, S t .The latter, S t , takes a value on a finite set of N possible values and is assumed to be a first-order Markov chain with the state transi- , which can be univariate or multivariate are conditionally independent given the underlying states (and parameters), and assumed to be regularly spaced in time, t, with the associated observation process HMMs are suitable for fitting to the visual tracking data since observations are collected at a regularly spaced interval, and for each time t, we specify N = 2 discrete states corresponding to foraging S t = 1 and not-foraging S t = 2 .
Each observed data point, X t , is bi-dimensional, consisting of the step length (km), r t and the turning angle (radians), t .At each time t, the distribution of X t is conditional on the current hidden state, S t , such that where r t is modelled from a gamma distribution with parameters mean, , and standard deviation, , that is, r t | S t = j ∼ Gamma j , j , and t is modelled from a von Mises distribution with parameters mean, , and concentration, , that is, t | S t = j ∼ von -Mises j , j .We assume the distributions are independent for each time t, conditional on the underlying state S t .
The corresponding likelihood of a HMM is a function of the following parameters: (i) : (1 × N) vector of initial state distribution given as (1) The general likelihood function of a HMM is then given by where 1 is a column vector of length N with all entries equal to 1.
We estimate these parameters, = { , , , , , }, via maximum likelihood estimation and obtain the most likely state sequence using the Viterbi algorithm (Zucchini et al., 2016).We used the R package momentuHMM (McClintock & Michelot, 2018) for fitting HMMs to the boat tracks since the package is widely used by scientists studying animal movement.We followed the guidance outlined in Michelot et al. (2016) to specify initial starting values for the model parameters.

| HMMs specification and selection
Boat GPS locations were recorded at 1 s intervals and do not have missing data.The completeness of the data means it is possible to use the recorded positions directly without the need to standardise the recording frequency by interpolating in time and space.Seabirds have been shown to vary their behaviour and area use at different breeding stages, travelling further from the colony to rich foraging grounds during incubation and remaining closer to the colony to feed chicks during chick-rearing (Robertson et al., 2014).As behaviour is expected to differ between the two periods, we expect model parameters to differ.We consider different models by varying model F I G U R E 3 Graphical representation of a HMM where S t and X t denotes the state and observed process.

Pooling effect
Covariate effect

State process
Observed process TA B L E 1 HMMs (Models 0-6) fitted to the boat tracking data across study sites during incubation and chick-rearing.Covariate = Euclidean distance of the boat to the study site.
parameters for each tern species at each colony during the two periods, summarised in Table 1.
Model 0, the base model, specifies that the state and observed processes are completely pooled across the individual visual tracks so that the model parameters are assumed to be the same for all individuals.Model 1 assumes a unique transition probability matrix parameter, Γ, for each individual by removing the pooling effect on the state process.Model 2 assumes unique step length parameters for each individual track by removing the pooling effect on the observed process across tracks.The pooling effect on both state and observed process is not included in Model 3.
The Euclidean distance of the boat to the colony was included as an environmental covariate on the state process in Model 4, the observed process in Model 5 and both processes in Model 6.The parameters associated with the observed and state process are pooled across individual tracks for Models 4, 5 and 6 respectively.
For seabird species in the breeding season, distance to colony will typically be a dominant factor in determining movement and spatial distributions at-sea (e.g.Wakefield et al., 2017), because adults need to return to the colony regularly in order to feed their chicks.It is therefore natural to include the Euclidean distance of the boat to the colony as a potential covariate to describe both the probability of transition between states and the distribution of step length and turning angle within each state and the energetic cost of travelling to a particular location from the breeding colony (Wilson et al., 2014).We assume a simple logistic regression model to include the covariate for the state process in a 2-state HMM (Models 4 and 6).Let c denote the covariate for the 2-state HMM, we set where and 0 , 1 corresponds to the intercept and the regression parameter of the covariate respectively.For the observed process (in Models 5 and 6), the covariate is included in the mean of the step length as = exp 0 + 1 c and included in the standard deviation of the step length as = exp 0 + 1 c .Model selection was performed using the Akaike information criterion (AIC) (Burnham & Anderson, 2002).

| Model validation
The validation data consist of the observed behaviours of visually tracked terns.The inferred behavioural states from HMMs and validation data are assumed to be binary classifications: foraging and not-foraging.Common evaluation metrics for binary classification tasks include confusion matrix accuracy, F1-score, area under a ROC curve and logarithmic loss (Hossin & Sulaiman, 2015).We observed an imbalance in the observed behavioural state distribution.In particular, we identified an unbalanced classification for some breeding colonies such as Cemlyn, Isle of May and Leith.Furthermore, we note that the positive class-foraging behavioural state is less prevalent in the study but is of an increased interest as it is this state that helps to identify tern foraging areas.We use the F1-score metric to validate behavioural states of visually tracked terns inferred from HMMs as it aims to address the problem of imbalanced classes.In particular, the F1-score metric corrects for the imbalance using the harmonic mean between the precision and recall.The (i) positive predictive value (PPV or precision), which is the proportion of correct positives identified from all the predicted positives is calculated as and (ii) true positive rate (TPR or recall), which is the proportion of the positives that are predicted correctly is expressed as Using Equations ( 7) and ( 8), the F1-score is calculated as We also report the negative predictive value (NPV), which is the percentage of correct not-foraging behavioural states of all the decoded not-foraging states expressed as Although the F1-score is a good validation metric, it does not account for how close the decoded behavioural state is to the observed behavioural state.However, the logarithmic loss metric, which is based on probability, does account for the uncertainty in the predicted classification (Hossin & Sulaiman, 2015).Thus, we also consider the logarithmic loss for the fitted HMMs to account for the uncertainty of the decoded behavioural state.We use the observed behavioural states at each point, y i , and the predicted probabilities of locally decoded behavioural state, q i , to calculate the logarithmic loss metric as where n is the number of observations.The fitted model with the lowest log-loss value is deemed optimal for this criteria, and we report the F1-score, PPV and TPR corresponding to the optimal HMM for each species and colony site.
In addition to the validation metrics, we (i) compare the observed behavioural data of each tern species to their respective decoded behavioural data obtained from optimal HMMs across breeding colonies, (ii) identify all foraging events from the observed behavioural data (where a foraging event is a bout within which only observed foraging behavioural states are recorded), (iii) note the decoded foraging behaviour from optimal HMMs that falls within the window of each observed foraging event as correctly inferred and (iv) calculate the proportion of observed foraging events where optimal HMMs identify (a) less than 25% (0% exclusive), (b) 25%-49%, (c) 50%-74% and (d) at least 75% of the observed foraging behavioural states.We also obtain the proportion of observed foraging events completely missed from the foraging behavioural states inferred from optimal HMMs (i.e.observed foraging events where the model infers foraging at 0% of the time points).S4).There are strong similarities between the locations (as would be expected given the boats were following the birds) and step length distributions.However, there appear to be more substantial differences with the turning angle distributions (lower panel of columns 2 and 3 in the figures).The latter difference can be explained by the bird making quicker turns compared to the boat, which has smoother turning movements.

| Assessment of visual tracking data as a proxy for tern movement data
We fitted HMMs to both boat and inferred tern location data.
Comparison of HMMs using locations of the boat and inferred locations of the tern are very similar.We observed little difference in the inferred behavioural states when using boat location to approximate the location of the tern.The confusion matrix metrics in Figure 5 indicate that the proportions of true positives and true negatives when comparing behaviours derived from fitting HMMs to boat and inferred tern locations against each other are higher than those of false negatives and false positives.This implies that majority of the time, foraging behavioural states decoded by HMMs from the boat locations correspond to foraging behavioural states decoded by HMMs from approximate tern locations with very few instances of cases where decoded behavioural states differ from both location data.Thus, the boat locations are adequate for HMMs in this case.

| Validating HMM-inferred behavioural states
Reported results are based on HMMs deemed optimal (i.e.HMMs with the lowest log-loss value).Tables 2 and 3 present the summarised results of 2-state HMMs fitted to the visual tracking data during chick-rearing and incubation (see Tables S2-S4 for additional results).
Correctly decoded foraging states relative to total decoded foraging states ranged from 65% to 98% during chick-rearing.We note that correct decoded foraging states relative to total observed foraging states ranged from 70% to 91% except for Arctic terns from Cemlyn and Coquet study sites with 58% and 61% respectively.
Overall, the performance of HMMs as measured by the F1-score   colony during incubation.The overall low performance during this breeding season may be due to the small sample sizes.
Examining the corresponding observed behavioural data for each movement track of the boat, we identified and defined a foraging bout within each track where observed foraging behaviours were recorded as a foraging event.Optimal models correctly identify at least 50% of foraging behaviour within each observed foraging event, most times during chick-rearing (Figure 6).The reverse is, however, the case during incubation (Figure 7).The number of observed foraging events completely missed across study sites (i.e.observed foraging events where the model infers foraging at 0% of the time points) sums to 65, with a median time of 13 s (Figure 8).
The visual tracks coloured with behavioural states (see, e.g. Figure 9) reveal similarity in the inferred and observed behavioural states across time points within visual tracking trips conducted across breeding colonies.Figure 10 provides histograms of the step length and turning angle overlaid with the density curves of the inferred behavioural states for a given track (see Figures S5-S7 for additional tracks).The inferred states assigned to foraging show shorter step lengths and lower directional persistence in turnings than the not-foraging states, which exhibit larger step lengths and high directional persistence in turnings.
All models fitted appeared to have similar inferred states so that the inferred states were largely insensitive to the set of models considered.However, AIC identified the same, relatively complex model (e.g. an HMM with a relatively large number of model parameters) across many species and breeding colonies, while the validation metrics identified much simpler models.
During incubation, we observe that the HMM accounting for the Euclidean distance of the boat to the colony as a covariate effect is mostly considered optimal compared to the chick-rearing period.
Furthermore, since there are no young terns to look after at the colony during incubation, terns are likely to forage further from the colony during this period compared to chick-rearing.Thus, accounting for the distance of the terns to the colony in HMMs may provide better behavioural inference.

| Visual tracking as a tool for validating HMMs
Inferred behavioural states from telemetry data have not been validated in many previous studies due to the difficulty in obtaining concurrent observed behavioural data.However, these inferred behaviours are used in ecology to delineate important areas, such as those used for foraging, and effective conservation planning and management decisions are taken based on the location of these behaviours.Given the current climate and appetite for increasing the number of protected areas on land and sea globally (e.g.protecting 30% of the earth by 2030 target from the UN Biodiversity COP 15), it is crucial to assess the validity of behaviours inferred from HMMs used in identifying the size and location of essential areas to be protected.
In practice, behavioural states of seabirds are mostly inferred from HMMs fitted to telemetry data (Langrock et al., 2012), and our study is the first to infer behavioural states of seabirds from visual tracking data using HMMs.We acknowledge that there may be potential effects of the boat following the seabirds on their behaviour, the inferred states and the validation process itself.However, previous studies have shown that the visual tracking method does not unduly affect bird behaviour due to a reasonable distance maintained between the individuals and the boat; moreover, most birds appear to ignore the boat (Perrow et al., 2011;Robertson et al., 2014;Wilson et al., 2014).The distance between the boat and the bird was, however, increased when there was a noticeable change in behaviour, such as evasive flight, observed for a few birds (Perrow et al., 2011, Robertson et al., 2014, Wilson et al., 2014).These previous studies did not states of terns were inferred from HMMs fitted to the boat tracks and the corresponding actual (estimated) tern location data.We acknowledge that the boat and approximate tern position were compared for a small number of tern species and restricted to a single colony (Coquet) and breeding period (chick-rearing), which may impose limitations on the how representative the data and how generalised the interpretation of the results can be.However, there are previous studies where individual terns were tracked visually using a boat with tracks obtained from the onboard GPS as proxies for foraging tracks have been used to successfully identify foraging behaviours and areas of tern species (Perrow et al., 2011, Wilson et al., 2014).
The unique approach of the visual tracking method provides telemetry data for the boat, a proxy for the tracks of the terns they are following and additional behavioural observation data, which are difficult to access in terrains such as the marine environment.
Consequently, it allows inferred behaviours of seabirds to be validated using behavioural observations.

| Validating inferred behavioural states
Our study investigated the accuracy of HMMs fitted to visual tracking data from different tern species across breeding colonies in the United Kingdom during the breeding season, using behavioural observation data recorded by observers on the boats.
Results suggest that HMMs can correctly infer behavioural states from tracking data.A similar observation has been shown for inferred behavioural states from HMMs using additional accelerometer and magnetometer data from four species of albatross Conners et al. (2021) and fishermen's movement data with frequency differing from the observed behaviours (Joo et al., 2013).
These methods used to infer behaviours are subject to the accuracy of the measurement devices.Our study is the first to validate HMMs using observed behaviours taken concurrently as the data in the same spatial and temporal context.Generally, HMMs performed reasonably well at decoding behavioural states.
However, the performance during incubation was poor compared  3).Terns on the Isle of May had reduced breeding success in 2010.Therefore, terns that were tracked may have included failed or non-breeders which are not required to return to the colony regularly to attend to eggs or chicks, and so the data for this colony and year may be potentially unrepresentative of breeding adults (Wilson et al., 2014).
The capacity of HMMs in identifying and capturing most foraging behavioural activities within a foraging bout was low for roseate terns at Blue Circle and common terns at Leith during incubation in 2010 (Figure 6) and common terns at Cemlyn and Leith during chick-rearing (Figure 7).The visual tracking method was aimed at chick-rearing (2009)(2010)(2011) but was extended to incubation in 2010, resulting in a reduction in the frequency of data collection (through survey effort being split between time periods) (Wilson et al., 2014), which may be a potential reason for the poor performance of fitted HMMs during incubation and chick-rearing in 2010.Observed behavioural data showed that common terns at Leith colony foraged closer to the colony during chick-rearing, 2010 (Figure 2b).The Leith common tern colony is in a port, so there may have been speed restrictions on the boat and limitations to how well the boat could closely replicate the movement of the terns.It is unclear from our study the exact reason why fitted HMMs did not identify most foraging behavioural states of common terns within foraging events at Leith and Cemlyn.However, overall, 70% (Leith) and 81% (Cemlyn) of the foraging behavioural states were decoded correctly from HMMs.HMMs inferred foraging behavioural states 0% of the time for some observed foraging events that lasted for an average of 21 s.
These missed foraging events were most common in chick-rearing.
Terns forage close to the colony during chick-rearing and do not travel for long distances (as they do in incubation) (Eglington & Perrow, 2014).Also, observers noted short sessions of foraging behavioural activities of some tracked terns in some colonies (JNCC personal communication).As a result, the track of the boat may not capture tern movement corresponding to these short observed foraging events.Consequently, boat tracks may not have represented the tern's track correctly within those short phases of foraging events.As such, the HMMs fitted to boat tracks from such a scenario could not have decoded foraging states within the foraging bout from the boat tracks.
The choice of the number of behavioural states to fit in HMMs is a major challenge in animal movement modelling particularly when the goal is to infer behavioural states from telemetry data.
AIC tends to select HMMs with more states but may not correspond to or have a meaningful biological interpretation of the studied animal.Pohle et al. (2017) provides practical guides in selecting the number of states to fit HMMs.Given a fixed number of states, an additional model selection process may include covariates or consider pooling across individual tracks.However, within our study, these different models did not lead to any substantial differences between the inferred behavioural states, as identified by McClintock (2021).Therefore, fitting less complex HMMs may likely outperform complex models in inferring hidden behavioural states from movement data.As such, when behavioural inference is the study's goal, it may be preferable to consider simpler models (i.e.including a smaller number of model parameters) when choosing an appropriate HMM to fit after selecting the desired number of states.
Our findings have relevance to conservation management and planning.Seabird colonies are more likely to be included as part of protected area networks due to their aggregated nature and relative ease of delineation than areas used by seabirds at sea, especially for species with large foraging ranges from the colony.Foraging areas are considered important habitats to include within seabirdprotected area networks (Lascelles et al., 2016).Thus, foraging behavioural activities can be a focus for future studies looking at using behavioural states to inform conservation and management, such as identifying the optimal size and location of foraging areas around seabird colonies.In addition, our study could be extended to assess how temporal validation translates to spatial validation.The visual tracking data could be used to compare the spatial distribution of behaviours inferred from HMMs with the spatial distribution of observed behaviours to determine the accuracy of foraging areas detected using HMMs with real-world implications for conservation and management.
Our study provides evidence, in the context of UK tern populations, that using HMMs to infer foraging behavioural states can help identify most foraging events correctly, and it would be valuable to explore in future work whether this result can be generalised to other taxa and geographical areas.Additionally, we provide evidence that when a 2-state HMM is implemented (interpreted for the UK terns as foraging and not foraging) most foraging and non-foraging events are identified correctly by the chosen model.Future work can be explored to determine whether this would hold true for three or more states for UK terns, and whether the result could be generalised to other species and taxa in three or more state HMMs.Missed foraging events or bouts may be less frequent from HMMs fitted to telemetry data of seabirds as GPS devices attached to seabirds are more likely to capture movement patterns influenced by short foraging behavioural activities that last a short time than HMMs fitted to visual tracking data.Therefore, using HMMs for behavioural inference, particularly the foraging behaviour of seabirds, can aid spatial planning and inform conservation decisions, hence providing a tool for the effective management of the impact of human activities on seabirds and other species.
In summary, using HMMs to infer important conservationrelevant behaviours from telemetry data appears defensible for UK tern populations based on our results, suggesting that they can be used to inform the design of designated protected areas.A priority for future research would be to explore the extent to which these results are generalisable.Understanding how generalisable the results from this study are to other taxa or species would require appropriate validation data which are collected contemporaneously with tracking information on individuals.We therefore recommend that observational behaviour data that can be used as a 'gold standard' for validation should be collected, wherever feasible, alongside GPS data.Lastly, there is evidence from our validation study that given the same number of behavioural states, there may be no substantial differences in the performance of simpler and complex HMMs in inferring behavioural states even in situations where standard model selection approaches, such as AIC, strongly suggest the use of more complex models.

Rebecca
fulmar [Fulmarus glacialis], little tern [Sternula albifrons], European shag [Phalacrocorax aristotelis], Arctic skua [Stercorarius parasiticus] breed in coastal areas in the north and west of the United Kingdom, with 80% occurring in Shetland, Orkney and the Outer Hebrides.Common terns have a widespread coastal distribution around the United Kingdom and also nest in small colonies inland along rivers and islets.Sandwich terns congregate in several large colonies, and most roseate terns breed on Rockabill, Ireland, summarise the protocol as follows: The visual tracking of terns was conducted during chick-rearing (June and July) and incubation (early May to mid-June) between 2009 and 2011.Rigid hull inflatable boats used for the visual tracking were operated by different skippers across the study sites.The boats were kept c. 50-200 m from terns while an individual was tracked to avoid disturbing the birds and affecting their behaviour.Longitude and latitude of the boats were recorded using an onboard GPS device set to a 1 s sampling frequency.Individuals were tracked on return foraging trips from their breeding colony.One observer maintained constant sight of the tracked individual, while another recorded behavioural information.
50-200 m from the tracked terns, we investigate how well boat tracks replicate the movement of tracked individuals using additional information on the animal's recorded position in relation to the boat.For a subset of tracks recorded at the Coquet Island colony during the chick-rearing in 2009, additional data were also collected corresponding to the distance and bearing of the tern from the boat, thus permitting the reconstruction of the (approximate) longitude and latitude location of the tern.Mathematically, let Lon boat and Lat boat denote the boat's longitude and latitude position and the bearing and distance of the boat to the tern be indicated by 'bearing' and 'distance' respectively.Then the corresponding tern longitude and latitude (Lon tern , Lat tern ) are tion probabilities given as (iii) P X t : (N × N) diagonal matrix corresponding to the observation process given as Lat tern ≈ arcsin sin Lat boat × cos(distance∕R) + cos Lat boat × sin(distance∕R) × cos(bearing) , Lon tern ≈ Lon boat + atan2(y, x), R = 6371 km(radius of the earth), y = sin(bearing) × sin(distance∕R) × cos Lat boat , and x = cos(distance∕R) − sin Lat boat × sin Lat tern .
Visual tracks of common terns coloured with observed behavioural states during (a) incubation and (b) chick-rearing period from Leith dock, 2010.
PPV = number of true positive number of true positive + number of false positive (8) TPR = number of true positive number of true positive + number of false negative (9) F1 -score = 2 PPV * TPR PPV + TPR (10) NPV = number of true negative number of true negative + number of false negative Figures S1-S4).There are strong similarities between the locations Validation results of 2-state HMMs fitted to visual tracking data of terns during chick-rearing.
investigate the extent to which boat-based tracks replicate the path taken by the birds.Our study shows that movement data from the boat being used to visually track terns closely replicated those from the estimated location data of the terns being tracked, particularly for movement tracks corresponding to the foraging behavioural states of terns.Additionally, similar behavioural F I G U R E 6 Proportion of correctly inferred foraging states within each observed foraging event across the study sites during chickrearing.

F
Observed foraging events completely missed from inferred foraging events across the study sites during chick-rearing (CR) and incubation (IN).A, Arctic; C, Common; S, Sandwich tern. to chick-rearing, particularly for Arctic terns at the Isle of May (42% see Table

F
Visual tracks of six common terns coloured with (a) observed and (b) decoded behavioural states from South Shian colony during the chick-rearing period, 2011.F I G U R E 1 0 Histograms showing the distribution of (a) step length (km) and (b) turning angle of six visually (boat) tracked common terns from South Shian colony during chick-rearing period, 2011.Lines represent HMM-fitted state-dependent distributions coloured according to the decoded behavioural states.
TA B L E 3Note: Γ = transition probability matrix, covariate = Euclidean distance of boat to colony.